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Abstract 



Given a graph where vertices represent alternatives and arcs represent pairwise comparison 
data, the statistical ranking problem is to find a potential function, defined on the vertices, 
such that the gradient of the potential function agrees with the pairwise comparisons. Our 
goal in this paper is to develop a method for collecting data for which the least squares 
estimator for the ranking problem has maximal information. Our approach, based on ex- 
perimental design, is to view data collection as a bi-level optimization problem where the 
inner problem is the ranking problem and the outer problem is to identify data which max- 
imizes the informativeness of the ranking. Under certain assumptions, the data collection 
problem decouples, reducing to a problem of finding graphs with large algebraic connec- 
tivity. This reduction of the data collection problem to graph-theoretic questions is one of 
the primary contributions of this work. As an application, we study the 2011-12 NCAA 
football schedule and propose schedules with the same number of games which are signif- 
icantly more informative. Using spectral clustering methods to identify highly-connected 
communities within the division, we argue that the NCAA could improve its notoriously 
poor rankings by simply scheduling more out-of-conference games. 

Keywords: Bayesian experimental design, graph synthesis, algebraic connectivity, sta- 
tistical ranking, scheduling. 

1. Introduction 

The problem of statistical ranking 1 arises in a variety of applications (and in particular com- 
petitive sports), where a collection of alternatives (teams) is to be ranked based on pairwise 
comparisons (games). Methods for ranking must address a number of inherent difficulties 
including (1) incomplete data (not all teams play all other teams), (2) inconsistencies in 
the data (team A beats team B, team B beats team C, and team C beats team A), and (3) 
the data is imbalanced (the "strength of schedule" varies amongst the teams). Despite and 
possibly as a consequence of these difficulties, although ranking from pairwise comparison 
data is an old problem David (1963), there have been several recent contributions to the 
subject Langville and Meyer (2012); Osting et al. (2012); Hirani et al. (2011); Jiang et al. 

1. To prevent confusion, we note that we use the term ranking to indicate a numerical score for each item 
in a collection, which is sometimes referred to as a rating. 
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(2010); Callaghan et al. (2007). In this work, we adopt the language associated with sports 
rankings (for example, collecting pairwise data will be referred to as scheduling games), but 
our results have broader applicability in data collection, e.g., problems in social networking, 
game theory, network security, and logistics. 

Our central goal in this paper is to investigate the dependence of the ranking problem on 
the schedule of play, which we denote by w. When viewed as a statistical inverse problem, 
the ranking problem is to estimate the overdetermined parameter which describes each 
team's strength (ranking), (f), given (noisy) observations, y, which represents the margin 
of victory for all scheduled games. Symbolically, an estimator for the ranking problem is 
expressed 

4> w =K(y,w), (1) 

where the dependence of the ranking, cj> w , on the schedule, w, is emphasized by the subscript. 

Generally speaking, the more games played amongst a fixed number of teams (i.e., the 
more pairwise data gathered), the more informative we expect the ranking, <p w . That is, 
there is a tradeoff between the number of games played and the value of the ranking. For 
example, in a single elimination tournament with n teams, there are only n— 1 games played. 
Here, we expect that the "best team" wins the tournament, but it is difficult to rank the 
remaining teams in any reasonable way. At the other extreme, a round-robin tournament 
amongst n teams requires Q) games which may not be possible if n is large. In this paper, 
we consider the following question: For n teams playing m games, with n — 1 < m < Cfy, 
how can the schedule be arranged to produce the "most informative" ranking? 

We follow the methodology of the optimal design community Haber et al. (2008); 
Pukelsheim (2006); Melas (2006); Fedorov (1972), and consider the Fisher information for 
the ranking estimate 1Z, defined in (1). Assuming 7Z is unbiased, i.e., ~K<j) w = <j), the Fisher 
information is the inverse of the variance, Y&r(4> w ), and thus maximizing the informative- 
ness of the ranking is equivalent to minimizing Var(0 1 „). We are thus led to the following 
optimization problem: 

min /(Var(^)) (2a) 

w 

such that <j) w = 1Z(y, w) (2b) 
w G Z+, \\w\\i = m (2c) 

where N := (™) = and /: S™ -> R is a convex function. For general optimal design 

problems, common choices for the scalar function f(A) include the following: 

f(A) = \\A\\ 2 = maxAj(^l) E-optimal (3a) 

i 

f(A) = tiA = Ai(A) A-optimal (3b) 

i 

f(A) = det A = Y[ A<(A) D-optimal (3c) 

i 

where {Xi(A)}f =1 denote the eigenvalues of A. The constraint in (2c) specifies that the 
schedule consists of m games. 
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A schedule can be represented as a graph, G = (V,E), with n nodes, denoted V, 
representing teams and m edges, denoted E, representing games. The schedule, w, is then 
an integer valued function on the edges with components defining the number of games 
played between the two incident teams. In §4, we show that for the least squares estimator, 
the constraint (2b) in the optimization problem (2) decouples, yielding a graph synthesis 
problem of finding the graph whose graph Laplacian has desired spectral properties. For 
example, an E-optimal schedule corresponds to a graph with maximal second Laplacian 
eigenvalue. 

Current practice There are large variations in the methods currently used for sports 
scheduling. Here, we describe the type of scheduling which is the focus of this work, begin- 
ning with the following distinction: 

• static scheduling: The schedule is determined prior to the season, independent of 
the performance of teams throughout the season. Examples of leagues employing 
static schedules include NCAA football and Major League Baseball (MLB). 

• dynamic scheduling: The schedule dynamically changes based on score results. For 
example, in a single elimination tournament, a team advances to the next round only 
if they win in the current round. Leagues which partially rely on single elimination 
tournaments include ATP tennis and FIFA World Cup soccer. 

While dynamic schedules incorporate the results of previous games and are thus more infor- 
mative than static schedules, they have the disadvantage that they may not be completely 
determined prior to the season. In this paper, we focus only on static scheduling. 

For statically scheduled games, the most important quantity is the ratio of the total 
number of games played to the total number of teams. In MLB, there are 30 teams, 
divided into two leagues: the American League (14 teams) and the National League (16 
teams). During the regular season, each team plays approximately 160 games, primarily 
against teams within the same division. Thus within each league teams play an average of 
160/15 ~ 10 times. With so many games and equal strength of schedule amongst teams, it is 
intuitive that the scheduling has little effect on the rankings. And, in fact, MLB simply uses 
win/loss percentages for ranking purposes. In NCAA football however, there are 120 teams 
in the NCAA Football Bowl Subdivision (FBS) and each team plays approximately 6 games 
per year within FBS. Thus each team only plays roughly 5% of the other teams. There are 
several rankings for NCAA football which are generated either mathematically or by expert 
opinion and then aggregated to determine official rankings and select teams to compete in 
the prestigious end-of-season "bowl games" . The fact that these rankings generally disagree 
and that none of them is more reliable than the others suggest that none of them are very 
informative. It is this situation, where there are relatively few games compared to the 
number of teams, that the schedule indeed has a large effect on the rankings. In this paper, 
we construct schedules for which the associated rankings are significantly more informative 
than the NCAA football schedule. 

Outline In §2, we review related work. In §3, we review properties of the eigenvalues of 
the graph Laplacian and establish notation used in subsequent sections. In §4 we study 
the schedule design problem (2) and show the reduction of (2) to a graph synthesis prob- 
lem. In §5, we conduct a number of numerical experiments to demonstrate properties of 
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nearly-optimal and randomly generated schedules. These are compared with the 2011-12 
NCAA Division 1 football schedule. Finally, we conclude in §6 with a discussion of further 
directions. 

2. Related work 

Our work is related to three subject areas, which we discuss in turn: statistics and experi- 
mental design, sports scheduling, and graph theory. 

Statistics and experimental design Excellent surveys of the optimal experiment design 
literature can be found in Haber et al. (2008); Pukelsheim (2006); Melas (2006); Fedorov 
(1972). 

Methods of optimal experiment design have been applied to ill-posed inverse problems, 
e.g., in geophysical Haber et al. (2008) or biomedical imaging (Horesh et al., 2011, ch. 13, 
p. 273-290), Chung and Haber (2012); Quinn and Keough (2002); DiStefano 3rd (1976). It 
is instructive to consider the analogy between these applications and the scheduling design 
problem considered here. In imaging systems, there is a tradeoff between the amount of 
collected data and the accuracy of the reconstruction, or equivalently, the sparsity of the 
measurement and the uncertainty in the solution to the inverse problem. For application 
dependent reasons (e.g., high radiation dose to a patient or the cost of collecting data), it 
is often desirable to place as few sensors as possible while still maintaining an acceptable 
accuracy in the reconstruction. In sports scheduling, the goal is to construct the best 
ranking possible from a small number of games. In both situations, it is desirable to take 
"measurements" which are maximally informative. 

Sports scheduling: single-elimination tournaments and active learning methods 

Previous work in sports scheduling can be roughly divided into the following two categories. 

The first type of scheduling focuses on the seeding policy of single-elimination tourna- 
ments with the objective of arranging the teams so that the outcome of the tournament 
agrees with a preexisting ranking D'Souza (2010); Scarf and Yusof (2011); Glickman (2008) 
or an arrangement which favors a particular team Vu et al. (2001). These objectives de- 
pend on a preexisting ranking of the teams, which we do not assume to know in this paper. 
Another type of tournament scheme is investigated in Ben-Nairn and Hengartner (2007), 
where a sequence of rounds of diminishing size are used to determine the best team. 

The second type of scheduling is a dynamic scheduling method. Games are scheduled 
which maximize the expected gain in information and thus one can view the resulting 
schedules as a greedy algorithm to learning as much as possible about the rankings Glickman 
(2005). This is the active learning approach, where past observations are used to control 
the process of gathering future observations, see, for example, Krause et al. (2008); Seeger 
and Nickisch (2011); Silva and Carin (2012). The current work shares the same objective 
with this second type of scheduling. The difference is that we do not update the expected 
game outcomes as the season progresses. In this case, the schedule can be fixed before the 
season begins. This simplification is significant and leads to a formulation of the scheduling 
problem which has an interesting graph theoretical interpretation. An extension of this 
work is the interpretation of the dynamic scheduling method as a perturbation to the static 
problem. 
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Finally, we remark that considerations of the schedule cost lead to variations of the 
traveling salesman problem. 

Graph theory In this paper, we reduce the schedule design problem (2) to a graph 
synthesis problem. We focus on the optimality condition given in (3a), which reduces to 
finding graphs with maximal algebraic connectivity. There is a tremendous amount of work 
on the algebraic connectivity of graphs, originating with studies by Miroslav Fiedler Fiedler 
(1973). Many properties of algebraic connectivity are reviewed in Mohar (1991); Biyikoglu 
et al. (2007) and we also review some of these results in §3. The problems arising from the 
other optimality conditions, (3b) and (3c), are less well studied Grimmett (2010); Ghosh 
and Boyd (2006a); Ghosh et al. (2008). 

The robustness of a network to node/edge failures is highly dependent on the algebraic 
connectivity of the graph. Also, the rate of convergence of a Markov process on a graph 
to the uniform distribution is determined by the algebraic connectivity Sun et al. (2004). 
Finally, in the "chip-firing game" of Bjorner, Lovasz and Shor, the algebraic connectivity 
dictates the length of a terminating game Bjorner et al. (1991). Consequently, algebraic 
connectivity is a measure of performance for the convergence rate in sensor networks, data 
fusion, load balancing, and consensus problems Olfati-Saber et al. (2007). 



3. Eigenvalues of the graph Laplacian and algebraic connectivity 

In this section, we briefly survey relevant results on the eigenvalues of the graph Laplacian 
and algebraic connectivity. More extensive treatments are given in Fiedler (1973); Biyikoglu 
et al. (2007); Mohar (1991); Chung (1997). In §3.1, we recall algorithms for computing 
graphs with large algebraic connectivity Ghosh and Boyd (2006b). 

Let B G ~R Nxn where N := be the arc- vertex incidence matrix for the complete 
directed graph on n nodes, 

1 j = head(/c) 
B k ,j = I -1 j = tail(&) (4) 
otherwise 

where the arc orientations (heads and tails of arcs) have been chosen arbitrarily. Let G = 
(V,E) be a graph with |V| = n. It is convenient to represent the edge set, E, by a vector 
w E {0, 1}^, such that the k-th edge is present if Wf. = 1 and Wk = otherwise. We refer 
to w as the indicator function of the edge set. The graph Laplacian of G is defined 

A„ := B l WB where W = diag(w). 

Let \i(w) for i = 1,.. . ,n denote the eigenvalues of A w . The eigenvalues are contained 
in the interval [0, min(n, 2d + )] where cf+ is the maximum degree of G. The first eigenvalue 
of A w , Ai, is zero with corresponding eigenvector vi = 1. The second eigenvalue, A2, is 
nonzero if and only if the graph is connected. The second eigenvalue is referred to as the 
algebraic connectivity of G and characterized by 

A 2 (u>) = min ||Bu|| 2)U ,. (5) 
||«||=i 
(u,l)=0 
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The eigenvector V2 corresponding to A2 is sometimes called the Fiedler vector after Miroslav 
Fiedler for his contribution to the subject Fiedler (1973). 

Let wi G {0, 1}^ for i = 1, 2 be the indicator function for two edge sets Ei defined on a 
vertex set V. It follows from (5) that w\ < W2 implies A2(u>i) < \2(w2)- That is, for fixed 
vertex set, E\ C E2 implies the more connected graph has greater algebraic connectivity. 

Let U C V and cut(£/, U c ) be the set of edges connecting U and U c := V \ U. Then the 
algebraic connectivity is bounded by the normalized graph cut, 

, , , n\cut(U,U c )\ 

In particular, if U = {v} where v G V is the node with smallest degree, i.e., d v = d-, then 
d v < — where m = \E\ and we obtain 

„ . . nd_ 2m 

A2 W < 7 < 7- 7 

n — 1 n — 1 

Properties of graphs for which the bound in (7) is tight have been studied Fallat et al. 
(2003). For an incomplete graph, the algebraic connectivity is bounded above by both the 
vertex connectivity, C V (G), and edge connectivity, C e (G), 

< A 2 < C„ < C e < d-, 



where d- is the minimal vertex degree Fiedler (1973). The algebraic connectivity can also 
be bounded in terms of Cheeger's inequality, Buser's inequality, and the diameter of the 
graph Biyikoglu et al. (2007); Mohar (1991); Chung (1997). 



3.1 Finding graphs with large algebraic connectivity 

In several applications, it is useful to compute graphs with large algebraic connectivity, (5). 
The problem of finding weights w G M. N which maximize \2(w) is a convex optimization 
problem and can be formulated as a semidefinite program (SDP) Ghosh and Boyd (2006b). 
However, if w G Z^f, the problem is NP-hard Mosk-Aoyama (2008). This is the case arising 
in the schedule optimization problem. 

The integer constrained problem may be solved by relaxing to the unconstrained problem 
and then rounding the solution. This is clearly an lower bound on the optimal solution and, 
if the values w are large, a reasonable approximation. Another approach, advocated by 
Ghosh and Boyd (2006b); Wang and Mieghem (2008), is to use the greedy algorithm based 
on the Fiedler vector described in Algorithm 1. This algorithm adds a specified number of 
edges to an input graph to maximize the algebraic connectivity of the resulting augmented 
graph. In this work, we refer to graphs produced via this method as nearly- optimal. 



4. Optimal scheduling using a least squares ranking 

We assume that each team j = 1, . . . ,n has a ranking (measure of strength) given by 4>j- 
We label each possible game by k = 1, . . . , = N and denote by B G M. Nxn the arc-vertex 
incidence matrix (4) for the complete graph. For each pair of teams k = {i,j}, we assume 
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Algorithm 1: A greedy heuristic for computing graphs with large algebraic connec- 
tivity Ghosh and Boyd (2006b); Wang and Mieghem (2008). See §3.1. 
Data: Given an initial graph d = (V, E{), and an integer number, M, of edges to 
add. 

Output: A graph G = (V, E) with large algebraic connectivity with edge set of size 

\E\ = \Ei\ + M. 
Set E = Ei (current edge set), 
for k = 1 to M, do 

Compute the Fiedler vector 

F = arg min ll-EMU w ■ 

|M|=i 

(v,l)=0 

Find the edge {i,j} ^ E which maximizes (Fi — Fj) 2 . 
_ Set E = EU{(i,j)}. 



that there exists some measure of the outcome of the games played between teams % and j, 
denoted y k , such that 

Vk = {B<t>) k + e k (8) 

where e k is a random variable with zero mean, i.e., Ee = 0. Let w k E Z + denote the 
number of games played between teams i and j. We assume that the variance of e k is given 
by a 2 /wk for some constant a. More games between teams i and j, reduce the variance 
in the observed pairwise comparison. We have in mind that y k is a function of the score 
differences (margin of victory) for the games played between teams i and j, but, surprisingly, 
our approach does not require this to be specified precisely. We only require the existence 
of such a measure of the game outcomes. 

Ranking There are several choices for the ranking in (1). The Gauss-Markov theorem 
states that the least squares estimator, 

4> w =TZ(y,w) (9a) 

= arg min \\B(j) - y\\ 2 , w (9b) 
(</>,l}=0 

= (BtWB^BtWy, (9c) 

is the linear, unbiased (E[^ w ] = (f>) estimator with smallest variance. The variance of 4> w is 
given by 

Var(^) = (B*WB)t = (10) 

which is the Moore-Penrose pseudoinverse of the unweighted graph Laplacian. Equation 
(10) is shown, using the linearity of the estimator, as follows. We first compute 

W = (BtWByBPWy = {B t WB)^B t W{B(j) + e) = 4> + (B^B)^ B l We. 

Thus, 

Var(0 w ) = E \(4> w - <f))(4> w - = {B t WB)^B t W'E. [ee*] WB(B l WB) ] . 
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Assuming that E [ee*] = a 2 W~ x , we obtain (10). 

The use of the least-squares estimate (9) in ranking has been referred to as HodgeRank 
Jiang et al. (2010) and is related to the Massey and Colley methods used in sports rankings 
Langville and Meyer (2012) and is the ranking method considered in the present work. 

Schedule Design The schedule design problem (2) is to find w such that the variance 
of the estimate cp w is minimal in the sense of the semi-definite ordering (i.e., A > B if 
A — B >z 0). It is remarkable that Y&r (cj) w ), given in (10) doesn't depend on the scores, 
y. Thus, the constraint in the optimal scheduling problem (2b) decouples. Traditional 
optimality criteria are functions of the eigenvalues of Var(0 to ) such as given in (3) Haber 
et al. (2008); Pukelsheim (2006); Melas (2006); Fedorov (1972). In what follows, we use 
the "E-optimality condition" of minimizing \ max (V&r((j) w )) . Using (10), this is equivalent 
to maximizing the smallest non-zero eigenvalue of the unweighted graph Laplacian, A w = 
B l WB. For a connected graph, the smallest non-zero eigenvalue is the second one, A2(u>) 
defined in (5). Thus the optimal schedule is obtained by solving the following eigenvalue 
optimization problem 

max \2(w) (11) 

w 

such that w G Z+, ||u>||i = m. 

For simplicity, in §5 we will further assume that each pair of teams plays at most once, 
i.e., w G {0,1}^. In this case, \2(w) is the algebraic connectivity of the graph (see §3), 
and (11) can be interpreted as the graph synthesis problem of finding a graph with n nodes 
and m edges with largest algebraic connectivity. The more general case corresponding to 
teams playing one another multiple times, i.e., w G Z+, can be interpreted as the algebraic 
connectivity of a multigraph. This direction is not further pursued here. 

We summarize the preceding discussion with the following proposition: 

Proposition 4.1 Let e be a random vector with Ee = and Var(e) = a 2 W~ l where W = 
diag(u>) and w G . Let <p w be the least squares estimator given in (9) for <f> in (8). The 
schedule w G Z^ with \\w\\\ = m which minimizes ||Var(0 W! )||2 is the solution of the graph 
synthesis problem (11). 

Remark 1 Other measures of optimality, such as those given in (3), may be used in place 
of the objective function in (11). The D-optimal objective function can be interpreted as 
the number of spanning trees within the graph Ghosh and Boyd (2006a). The A-optimal 
objective function is the total effective resistance of a electric circuit constructed by iden- 
tifying each edge of the graph with a resistor of equal resistance Ghosh and Boyd (2006a); 
Ghosh et al. (2008) and is related to the return time for a reversible Markov chain Grimmett 
(2010). 

Remark 2 The Hodge decomposition implies that the residual in (9b) , r = B<f> — y, can be 
further decomposed into two orthogonal components: (1) a divergence-free component which 
consists of 3-cycles and (2) a harmonic component which consists of longer cycles Jiang 
et al. (2010); Hirani et al. (2011). Jn fact, Jiang et al. (2010) argues that a dataset which 
has a large harmonic component is inherently inconsistent and does not have a reasonable 
ranking. The harmonic component lies in the kernel of the graph Helmholtzian which has 
dimension given by the first Betti number of the associated simplical complex. 
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Table 1: A comparison of several measures of connectivity for 4 well-known graphs. We 
assume n > 3. Subscripts on the eigenvalues denote multiplicity and [-J indicates 
the floor function. See §5.1. 



5. Numerical experiments 

In this section we study graphs corresponding to schedules which are good for ranking, 
and in particular, graphs with large algebraic connectivity. In §5.1, we consider structured 
graphs for which the eigenvalues of the Laplacian can be analytically computed and small 
graphs with < 5 edges. In §5.2, we compare the expected algebraic connectivity of Erdos- 
Renyi random graphs with graphs obtained using the greedy algorithm described in §3.1. 
In §5.3, we discuss the algebraic connectivity for the graph corresponding to the 2011-12 
NCAA Division I football schedule. 



5.1 Algebraic connectivity for example graphs 

In this section, we give results on the algebraic connectivity for graphs with easily com- 
putable spectra and graphs with a small number of nodes. In Table 1, we tabulate the 
eigenvalues, algebraic connectivity (5), edge connectivity, vertex connectivity, and diameter 
for 4 well-known graphs. The number of distinct ra-node, connected, unlabeled graphs for 
n =1, 2, 3, . . . are 1, 1, 2, 6, 21, 112, 853, 11117, 261080,. . . (Sloane A001349). In Fig. 1 we 
plot, for n = 4 and n = 5, each of these graphs together with the algebraic connectivity, A2. 

In Fig. 1, we observe that as the number of edges, m, is increased, the algebraic 
connectivity, A2, generally increases. Furthermore, for a fixed number of edges, m, the 
algebraic connectivity can vary significantly. For m = 5, 6, and 7, the value of A2 varies by 
a factor > 2. For m = 5, the graph with smallest A2 has small edge connectivity (and hence 
small algebraic connectivity) and the graph with largest A2 has nodes with equal degree. 
These small graphs beautifully illustrate the bounds given in §3. 
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Figure 1: The 4- and 5-node connected graphs and their algebraic connectivity, A2. Graphs 
corresponding to more informative schedules have large algebraic connectivity. 
See 65.1. 
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5.2 Algebraic connectivity of Erdos-Renyi random graphs and computed 
nearly-optimal graphs 

We consider the Erdos-Renyi random graph model G(n,p) containing graphs with n nodes 
and edges included with probability p, independent from every other edge. The expected 
number of edges for a graph in G(n,p) is pQj and the threshold for connectedness is 

_ logn 
Pc— n ■ 

There are several results on the spectrum of the graph Laplacian for Erdos-Renyi graphs, 
especially in the limit n t oo, see Chung and Radcliffe (2011); Oliveira (2009). Finite sized 
random graphs are less well-understood. The algebraic connectivity of Erdos-Renyi, Watts- 
Strogatz, and Barabas- Albert random graphs has been studied numerically in Jamakovic 
and Uhlig (2007). The algebraic connectivity of a Watts-Strogatz graph is known to have 
a phase transition Olfati-Saber et al. (2007). 

We will utilize the following elementary upper bound on the algebraic connectivity, 
analogous to (7), derived using a concentration inequality. 

Proposition 5.1 Let e > and assume n to be even. With probability at least 1 — e, the 
algebraic connectivity, A 2 , of an Erdos-Renyi graph G(n,p) satisfies 

\ 2 <np + An~ 2 ^2 log(l/e). (12) 
Proof Choose any subset U C V with \U\ = §. Equation (6) implies that A2 < ^ where 

2 

C ~ B(^,p). For a > 0, we compute 

pr (A2 > np + a) < pr (AC /n > np + a) = pr (C — pn 2 /4 > +an/4) < exp (— a 2 n 4 /32) 

where the last inequality is due to Hoeffding. Setting a = 4n~ 2 y / 2 log(l/e), we find that 
pr (A2 > np + a) < e as desired. ■ 



Remark 3 For odd n, Prop. 5.1 holds, except n(n 2 — 1) 2 replaces 4re 2 in (12). 

For a random graph G(n,p), the number of edges m ~ B(N,p) where := n(n — l)/2. 
Thus, E[m] = pN and we may restate (12): as: with probability at least 1 — e, 

A 2 <2£M + 4r*-V21og( 

n—1 

Indeed, the first term on the right hand side of (13) matches the right hand side of (7). 

In Figure 2, we plot, for n = 50 (left) and n = 100 (right) and p = .4 (blue), p = .6 
(red), and p = .8 (greeen) the value of m vs. A2 for 5,000 randomly generated Erdos-Renyi 
graphs. The mean values obtained are indicated by circles. We use the greedy algorithm 
described in §3.1 (see Algorithm 1) with initial graph taken to be the path with n vertices, 
P n , to compute nearly-optimal graphs with n- nodes and m-edges. The solid black line in 
Figure 2 represents the value of A2 for these graphs. Finally, the dashed blue line in Figure 
2 represents the upper bound on A2 given in (7) (compare also to (13)). 

We observe in Figure 2 that nearly-optimal graphs have values which are indeed close to 
the upper bound on the algebraic connectivity, indicating (i) the upper bound is nearly-tight 
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Figure 2: Algebraic connectivity, A2 as a function of m for 50- and 100-node graphs. The 
dashed blue line represents the upper bound on A2 given in (7). The solid black 
line represents the nearly-optimal value of A2. Finally, for p = .4 (blue), .6 (red), 
and .8 (greeen) we give a scatter plot of (m,A2) for 5,000 randomly generated 
Erdos-Renyi graphs. The mean values obtained are indicated by circles. See §5.2. 



and (ii) the greedy heuristic (Algorithm 1) produces graphs which are nearly-optimal. We 
also observe that the algebraic connectivity of nearly-optimal graphs is significantly better 
than the values for an average Erdos-Renyi random graph. 

5.3 2011-12 NCAA Division I football schedule 

In this section, we study the 2011-12 NCAA Division 1 football schedule downloaded from 
Massey Ratings 2 . As discussed in §1, the NCAA Division 1 Football League is divided into 
the Football Bowl Subdivision (FBS) and Football Championship Subdivision (FCS) 3 . The 
FBS is further decomposed into 12 conferences and the FCS into 15. Of the 246 teams in 
Division 1, 120 belong to FBS and 126 belong to FCS. Lafayette College is a member of 
FBS, however every opponent of Lafayette during the 2011-12 season was a member of the 
FCS. For our purposes, it is more convenient to reclassify Lafayette as a member of FCS 
and thus, in what follows, FBS has 119 teams and FCS has 127. There were m = 1430 
games amongst the Division 1 teams and m = 693 games amongst the FBS teams. 

5.3.1 Data visualization via spectral clustering 

We use the data visualization method described below to demonstrate that NCAA Division 
1 teams primarily play against other teams within their own conference. We then show that 
this clustering of teams by conference results in the graph having poor algebraic connectivity. 

We first use normalized spectral clustering to detect communities within the teams Shi 
and Malik (2000). This, in turn, relies on the /c- means algorithm where k is the desired 

2. http : //masseyratings . com/scores . php?t=11590&s=107811&all=l&mode=2&f ormat=0 

3. These were formally known as Division 1-A and 1-AA respectively. 
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• ACC 

• Big 12 

• Big East 

• Big Sky 

■ Big South 

■ Big Ten 

Colonial Athletic Association 

Conference USA 

Great West 

IA Independents 

IAA Independents 

Ivy League 

MEAC 

Mid-American 

Missouri Valley 

Mountain West 

Northeast 

Ohio Valley 

Pac-12 

Patriot League 

Pioneer Football League 

SEC 

■ SWAC 

• Southern 

■ Southland 
Sun Belt 

• WAC 




Figure 3: 2011-12 NCAA Division 1 (FBS and FCS) football schedule. Graph 
representation of schedule via spectral clustering by games, top: vertices represent 
teams, edges represent games, coloring indicates conference membership, bottom: 
community detection of teams (represented using pie-graphs) reveals that teams 
primarily play within their own conference. The dashed lines indicate an edge 
cut which is discussed in the text. See §5.3. 
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ACC 
Big 12 
Big East 
Big Ten 

Conference USA 

IA Independents 

Mid-American 

Mountain West 

Pac-12 

SEC 

Sun Belt 

WAC 




Figure 4: 2011-12 NCAA Division 1 (only FBS) football schedule. Graph repre- 
sentation of schedule via spectral clustering by games, top: vertices represent 
teams, edges represent games, coloring indicates conference membership, bottom: 
community detection of teams (represented using pie-graphs) reveals that teams 
primarily play within their own conference. See §5.3. 
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number of communities (27 for Division 1 and 12 for Division 1 FBS). Then, using the 
Matlab toolbox described in Traud et al. (2009), the Fruchterman-Reingold algorithm finds 
an optimal placement of the communities and the Kamada-Kawai algorithm is used for 
the placement of nodes within each community. The mean within-cluster sum of point-to- 
centroid distances for the fc-means clustering obtained for the Divsion 1 and Division 1 FBS 
data is 0.147 and 0.133 respectively. 

In Figures 3 and 4, we plot the 2011-12 NCAA Division 1 and Division 1 FBS football 
schedules respectively. In 3(top) and 4(top), the vertices represent teams, the edges rep- 
resent games, and each vertex (team) is colored by conference membership. In 3(bottom) 
and 4 (bottom), the vertices represent the spectrally clustered communities and the edges 
represent the community interactions. We observe from Figures 3 and 4 that the teams 
primarily play within their own conference, which has implications which we discuss below. 

We next compare the value of the algebraic connectivity for these schedules with sched- 
ules from Erdos-Renyi random graphs and proposed nearly-optimal schedules. 

5.3.2 Comparison of NCAA Division 1, Erdos-Renyi random, and 

NEARLY-OPTIMAL SCHEDULES 

In the introduction, we noted that there are several common scalar measures of Var(^»2), 
three of which are given in (3). In this section, we compare these various measures for the 
NCAA Division 1, Erdos-Renyi random, and nearly-optimal schedules. 

More concretely, let w be a given schedule (defining a graph on n vertices) and define 
the graph Laplacian: A w := B t [diag(w)]B. Define the following three functions of w: 



Note that as defined, it is desirable to maximize Je, Ja, and Jd m (14) while it is desirable 
to minimize the quantities defined in (3). 

For the Division 1 and Division 1 FBS schedules, we compute the various measures of 
the quality of schedule given in (14) and record them in Table 2. We also plot Je{w) given 
in (14a) in Fig. 5 by a red diamond. We next discuss schedules for which we compare the 
Division 1 and Division 1 FBS schedules in Table 2 and Fig. 5. 

The expected number of edges for a G(n,p) Erdos-Renyi random graph is pN where 
N := (2) • To compare to the football schedules, we take p = m/N and consider the family 
of random graphs, G(n,m/N). For n = 119 and m = 693, we choose p = m/N ~ 0.0987 
which is approximately 2.5 times the threshold for connectivity, p c = log(n)/n ~ 0.0402. 
For n = 246 and m = 1430, we choose p = m/N ~ 0.0475 which is approximately 2.1 
times the threshold for connectivity, p c = log(n)/n ~ 0.0224. In Table 2, we tabulate the 
expected values of the three quantities given in (14) for G(n,m/N) graphs, obtained by 
averaging over a sample size of 1000. Similar to §5.2, in Fig. 5, we give a scatter plot of 
(m, A2) for G(n, m/N) graphs and indicate the mean values with a blue circle. 



J E {w) 



A 2 (A W ) 



(14a) 




(14b) 



(14c) 



i: Xi^O 
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As in §5.2, we again use the greedy algorithm described in §3.1 (see Algorithm 1) to 
compute graphs with n nodes and m edges which nearly-maximize Je = A2. We then 
evaluate all three quantities given in (14) for these graphs and tabulate these values in 
Table 2. The solid black line in Fig. 5 is the best value of Je = A2 obtained. Finally, the 
dashed blue line in Fig. 5 represents the upper bound on A2 given in (7). 

We observe in Fig. 5 and Table 2 that the schedules which nearly-maximize Je{w) = A2 
have significantly larger values of Je than the NCAA Division 1 and Division 1 FBS sched- 
ules. In fact, the NCAA schedules have worse values than schedules associated with Erdos- 
Renyi random graphs of the same size. Furthermore, we show in Table 2 that schedules 
which maximize Je also have larger values of J a and Jo- That is, the schedules which are 
good in the sense of E-optimality are also good schedules in the sense of D- and E-optimality 
as defined in (3). 

The reason for the relatively poor value of Je{w) = A2 for the NCAA Division 1 and 
Division 1 FBS schedules can be understood from Figures 3 and 4, which are described 
in §5.3.1. Figures 3 and 4 reveal that teams primarily play within their own conference. 
This results in a small edge cut between a conference (or set of conferences) and its vertex 
complement, which, by (6), implies a small algebraic connectivity. For example, the edge cut 
indicated by the dashed line in Fig. 3 (entire NCAA Division 1 schedule) results in an upper 
bound on the algebraic connectivity of 1.297. The edge cut obtained by considering the set 
consisting of teams in the SWAC conference yields an upper bound equal to 1.043. Both of 
these bounds are already less than the expected value of A2 for Erdos-Renyi random graphs 
of comparable size (compare with the top part of the first column in Table 2). To summarize, 
the NCAA primarily schedules games amongst teams within the same conferences and this 
reduces the informativeness of the rankings. 





Je(w) in (14a) 


J A (w) in (14b) 


Jd(w) in (14c) 


Div. 1 FBS and FCS 




0.7015 


8.780 


2.363 


Erdos-Renyi, n = 246 




2.892 


9.681 


2.358 


E-optimal design, n = 


246 


6.630 


10.71 


2.403 


Div. 1 FBS 




1.725 


9.634 


2.372 


Erdos-Renyi, n = 119 




3.497 


9.911 


2.361 


E-optimal design, n = 


119 


7.142 


10.92 


2.402 



Table 2: A comparison of the three objective functions defined in (14) for the Division 1 
and Division 1 FBS schedules, Erdos-Renyi random schedules, and schedules which 
nearly-maximize Je{w) = A2- Schedules which nearly-maximize Je{w) = A2 also 
have larger values of J a and Je than the comparison schedules. See §5.3 



6. Discussion and future directions 

We have applied methods from optimal experiment design to provide a new framework for 
designing more informative schedules, which reduce the variance in ranking. At the heart 
of this framework is an optimization problem (2) where the inner problem is to determine 
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algebraic connectivity for n=l 19 nodes algebraic connectivity for n=246 nodes 




2500 



number of edges, m number of edges, m 

Figure 5: A comparison of Je{w) = A2 defined in (14a) for the Division 1 and Division 
1 FBS schedules, Erdos-Renyi random schedules, and schedules which nearly- 
maximize A2. The red diamonds represents the 2011 NCAA Division 1 (right) 
and Division 1 FBS (left) football schedule. The solid black lines represent the 
nearly-optimal values of A2 obtained for n = 119 (left) and n = 246 (right). 
The dashed blue lines represent the upper bound on A2 given in (7). The blue 
dots represent a scatter plot of (m, A2) for 1,000 randomly generated Erdos-Renyi 
graphs, G(n,m/N). The mean values are indicated by blue circles. See §5.3. 



the unbiased ranking for a given schedule and the outer problem is to design the schedule 
which minimizes the variance of the ranking. We illustrate this method for the least-squares 
ranking estimate, demonstrating that in this case, the outer problem decouples from the 
inner problem and reduces to an eigenvalue optimization problem (11). For the E-optimal 
schedule, the eigenvalue optimization problem is to maximize the second eigenvalue of the 
graph Laplacian, referred to as the algebraic connectivity. Graph theoretic results then 
describe characteristics of graphs with large algebraic connectivity. 

There are several applications in which improved scheduling can benefit ranking. In 
particular, we have demonstrated that graphs can be generated which represent schedules 
which are more informative than the 2011 NCAA Division 1 football schedule. More gener- 
ally, the scheduling methods developed here could be implemented in any situation where 
there is a large number of items to be ranked and a relatively small number of pairwise 
comparisons available. This includes, for instance, a film festival where a large number of 
films must be reviewed in a short amount of time by a small number of reviewers Xu et al. 
(2011). Additionally, the scheduling method proposed here could be used to determine 
which pairwise comparison data to collect in modern internet and e-commerce applications, 
where the collection of data can have associated cost. 

In the case of NCAA Division 1 football, we demonstrated in §5.3 and Table 2 that the 
nearly-optimal schedule in the sense of E-optimality is also a good schedule in the sense of 
D- and E-optimality; the choice of scalar function /: — > R as defined in (2) does not 
strongly effect the optimal schedule (see Remark 1). 
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The schedule design methodology advocated in Eq. (11) is flexible in the following two 
senses: (i) The optimal schedules contain symmetry with respect to permutations in the 
seeding of the teams. This problem has been studied previously for tournaments; see the 
discussion in §2. (ii) The optimal schedule is not time dependent and thus the scheduling of 
future games does not depend on past game performances, i.e., the schedule is completely 
known before the season begins and the games may be played in any order. These properties 
can be exploited in the further design of the schedule. 

In this paper, we have focused on scheduling for improved rankings, neglecting several 
other influential factors, including traveling limitations and preferences of games between 
particular teams, e.g., "rival teams". There are two simple methods which may be em- 
ployed to accommodate these additional factors. First, travel limitations or other financial 
aspects of gameplay can be incorporated by either adding a penalization term in (11) or by 
incorporating additional weights into the norm used to compute A2 in (11). Schedules with 
games between particular teams may be obtained by explicitly adding these edges to the 
input graph of the greedy Algorithm 1 for computing nearly-optimal schedules. 

We are interested in extending this work to nonlinear ranking methods, including robust 
estimators Osting et al. (2012), random walker methods Callaghan et al. (2007), Perron- 
Frobenius eigenvalue methods Keener (1993); Langville and Meyer (2012), and Elo methods 
Glickman (1995); Langville and Meyer (2012). 
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